Data Analysis of Red Wine Quality by Nathaniel Wharton

Introduction

A tidy dataset of variants of the Portuguese “Vinho Verde” wine were used for this analysis. The dataset comes from a 2009 study. It consists largely of sensory (output) variables and physicochemical (input) variables. This was used as the source for e.g: field descriptions: https://s3.amazonaws.com/udacity-hosted-downloads/ud651/wineQualityInfo.txt

Univariate Plots Section

About the Data

Column Names

##  [1] "X"                    "fixed.acidity"        "volatile.acidity"    
##  [4] "citric.acid"          "residual.sugar"       "chlorides"           
##  [7] "free.sulfur.dioxide"  "total.sulfur.dioxide" "density"             
## [10] "pH"                   "sulphates"            "alcohol"             
## [13] "quality"

Summary of the Column Data and Data Types

## 'data.frame':    1599 obs. of  13 variables:
##  $ X                   : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ fixed.acidity       : num  7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
##  $ volatile.acidity    : num  0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
##  $ citric.acid         : num  0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
##  $ residual.sugar      : num  1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
##  $ chlorides           : num  0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
##  $ free.sulfur.dioxide : num  11 25 15 17 11 13 15 15 9 17 ...
##  $ total.sulfur.dioxide: num  34 67 54 60 34 40 59 21 18 102 ...
##  $ density             : num  0.998 0.997 0.997 0.998 0.998 ...
##  $ pH                  : num  3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
##  $ sulphates           : num  0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
##  $ alcohol             : num  9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
##  $ quality             : int  5 5 5 6 5 5 5 7 7 5 ...

The data includes 1,599 observations and thirteen variables.

##        X          fixed.acidity   volatile.acidity  citric.acid   
##  Min.   :   1.0   Min.   : 4.60   Min.   :0.1200   Min.   :0.000  
##  1st Qu.: 400.5   1st Qu.: 7.10   1st Qu.:0.3900   1st Qu.:0.090  
##  Median : 800.0   Median : 7.90   Median :0.5200   Median :0.260  
##  Mean   : 800.0   Mean   : 8.32   Mean   :0.5278   Mean   :0.271  
##  3rd Qu.:1199.5   3rd Qu.: 9.20   3rd Qu.:0.6400   3rd Qu.:0.420  
##  Max.   :1599.0   Max.   :15.90   Max.   :1.5800   Max.   :1.000  
##  residual.sugar     chlorides       free.sulfur.dioxide
##  Min.   : 0.900   Min.   :0.01200   Min.   : 1.00      
##  1st Qu.: 1.900   1st Qu.:0.07000   1st Qu.: 7.00      
##  Median : 2.200   Median :0.07900   Median :14.00      
##  Mean   : 2.539   Mean   :0.08747   Mean   :15.87      
##  3rd Qu.: 2.600   3rd Qu.:0.09000   3rd Qu.:21.00      
##  Max.   :15.500   Max.   :0.61100   Max.   :72.00      
##  total.sulfur.dioxide    density             pH          sulphates     
##  Min.   :  6.00       Min.   :0.9901   Min.   :2.740   Min.   :0.3300  
##  1st Qu.: 22.00       1st Qu.:0.9956   1st Qu.:3.210   1st Qu.:0.5500  
##  Median : 38.00       Median :0.9968   Median :3.310   Median :0.6200  
##  Mean   : 46.47       Mean   :0.9967   Mean   :3.311   Mean   :0.6581  
##  3rd Qu.: 62.00       3rd Qu.:0.9978   3rd Qu.:3.400   3rd Qu.:0.7300  
##  Max.   :289.00       Max.   :1.0037   Max.   :4.010   Max.   :2.0000  
##     alcohol         quality     
##  Min.   : 8.40   Min.   :3.000  
##  1st Qu.: 9.50   1st Qu.:5.000  
##  Median :10.20   Median :6.000  
##  Mean   :10.42   Mean   :5.636  
##  3rd Qu.:11.10   3rd Qu.:6.000  
##  Max.   :14.90   Max.   :8.000

Let’s Look at histograms and boxplots of the columns:

Fixed.acidity is measured as the concentration (in g/dm^3) of tartaric acid. Most acids in wine fall in this category. We see a fairly normal distribution here.

Volatile.acidity is the concentration (in g/dm^3) of acetic acid in the wine. Higher levels of this can lead to, “an unpleasant, vinegar taste.”. Again, a fairly normal distribution with a bit of a long tail, slightly bi-nodal.

Citric acid concentration is in (g/dm^3). Apparently it, “adds ‘freshness’ and flavor to wines”. This looks like a sightly positively-skewed data. There appear to be a number of observations with very low levels of citric acid, and spikes at 0.25 g/dm^3 and 0.50 (g/dm^3). The outlier at 1.0 gives a longer-tail.

Residual sugar concentration (g/dm^3) has an early spike and a very long tail. It’s the amount of sugar remaining after fermentation stops.

Taking the log of the residual sugar concentration smoothes out the distribution a bit.

Chlorides represent the amount of salt in the wine. The concentration of sodium chloride (in g/dm^3). There is a spike of chlorides and a long tail of ourliers.

Transforming the value via a log, we see more of a spike with outliers than a bell curve.

A positively-skewed distribution of free.sulfur.dioxide (in (mg/dm^3)) with a long tail is observed. Free.sulfur.dioxide prevents microbial growth and wine oxidation.

The total amount of sulfur dioxide includes free and bound forms of S02 (in (mg/dm^3)). At concentrations of free SO2 over 50 ppm, SO2 becomes evident in the taste and nose of a wine. Here we see a positively-skewed distribution.

Taking the log, we see a more-normal distribution of the total.sulfur.dioxide.

Without transformation, we see a nice bell curve for density. Density here is in (g/cm^3). Apparently the density depends a bit on the percent of alcohol and sugar content.

We see anormal distribution of wines by pH. Not being previously familiar with the pH of wine, I was surprised to see it so ascidic (neutral water has a pH of 7).

Sulphates measures concentration of potassium sulphate (in g/dm3). It’s an additive that can contribute to S02 gas levels, and acts as an antimicrobial and antioxidant.

Sulphates are bit more normalized with log scale applied.

Alcohol (in % by volume) is positively skewed.

Alcohol maintains its positive skew even with a log transformation.

Finally we get to quality scores (on a scale of 0 to 10). We see that overwhelmingly, most wines received a 5 or 6. Strikingly, no values for the extremes: 0,1,2 or 9 and 10 are represented.

Univariate Analysis

Dataset Structure

The dataset for red wine is in a tidy format with separate observations of a particular wine on one row. The dataframe is wide with separate columns for each variable.

The ‘X’ column is an id stored as an integer and Quality (the, “output variable”) is stored as an integer value. All other columns (“input variables”) contain measurements stored as double precision floating point numbers.

What is/are the main feature(s) of interest in your dataset?

The main feature of interest in this dataset is quality. Quality is the single “output” variable that we can try to determine using the various “input” variables. Strikingly, though quality is supposed to be rated on a scale of 0-10, there are no observed values of 0,1,2 or 9 and 10 in the dataset. Most values are in the middle, either 5,6, or a 7, and there are a few observations of 3,4, and 8.

What other features in the dataset do you think will help support your
investigation into your feature(s) of interest?

Though potentially, any of the, “input variables” could help us understand the quality “output variables”, the notes, proclaim we may see either positive or negative correlations between certain inputs and quality: https://s3.amazonaws.com/udacity-hosted-downloads/ud651/wineQualityInfo.txt,

Particularly we can look to see these: - volatile acidity - when too high can lead to an, “unpleasant, vinegar taste”. - citric acid - can add, ‘freshness’ and flavor“. - total sulfur dioxide -”at free SO2 concentrations over 50 ppm, SO2 becomes evident in the nose and taste of wine“”

Did you create any new variables from existing variables in the dataset?

Later we see that 3 groupings were made for quality (rather than using the 0-10 scale present), but other than that, no new variables were created from the dataset (except to create a subset of data for correlations that didn’t include the observation identifier).

Of the features you investigated, were there any unusual distributions?
Did you perform any operations on the data to tidy, adjust, or change the form
of the data? If so, why did you do this?

Alcohol has an interesting non-normal distribution that wasn’t much affected by log tranformations. Some plots were re-arranged to cast-out outliers, but generally few operations were performed to change the form of the data.

Bivariate Plots Section

##        cor 
## -0.3905578

We see that (as predicted), lower mean volatile.acidity is correlated with higher wine quality. This is in line with the general expectation that volatile.acidity is, “unpleasant” when increased. The orange line represents the mean. The blue dotted lines represent quantiles of 10%, 50% (the median), and 90%. This pattern will be repeated below.

##       cor 
## 0.2263725

We see that (as predicted), there is a small correlation that higher citric acid is correlated with higher wine quality.

##        cor 
## -0.1851003

We see a small negative correlation of total.sulfur.dioxide, in-line with expectations.

##       cor 
## 0.4761663

The strongest correlation for quality was between % Alcohol Content and Quality with a Pearson’s r of 0.48. The mean value of alcohol % increases with quality.

Other Relationships Explored

Since the correlation between Alcohol and Quality was strongest, it was examined first. The strongest correlation appearred to be between it and density.

##        cor 
## -0.4961798

A negative correlation (r = -.50) was found between percent of alcohol and density. Thus, the more alcohol, the less density was seen.

##        cor 
## -0.5524957

There was a fairly significant (r = -.55) negative correlation between volatile acidity and citric acid.

##       cor 
## 0.6676665

There was a significant positive correlation (r=.67) between free and total sulfur dioxide, though this was to be expected.

##       cor 
## 0.6680473

Another strong positive correlation (r = .67) was found between density and fixed acidity.

##        cor 
## -0.6829782

A strong negative correlation (r = -.68) in the data was found between pH and Fixed Acidity. This makes sense as lower pH is used to measure higher acidity.

Bivariate Analysis

Talk about some of the relationships you observed in this part of the
investigation. How did the feature(s) of interest vary with other features in
the dataset?

To summarize, correlations were discovered between Quality (the feature of interest) and: - alcohol (r= .48) (strongest positive correlation) - citric.acid (r = .23) - volatile.acidity (r = -.39) (strongest negative correlation) - total.sulfur.dioxide (r= -.19)

Generally each of these were expected except for the strong correlation with alcohol content and quality.

Did you observe any interesting relationships between the other features
(not the main feature(s) of interest)?

Correlations were also found between:

  • alcohol and density (r=-0.50)
  • volatile.acidity and citric acid (r= -.55)
  • free.sulfur.dioxide and total.sulfur.dioxide (r = 0.67)
  • density and fixed.acidity (r = .67)
  • pH and fixed.acidity (r = -.68)

What was the strongest relationship you found?

The highest correlation (r = -.68) in the data was the negative correlation found between pH and Fixed Acidity. This makes sense as lower pH values are used as a measurement of higher acidity.

Multivariate Plots Section

Three variables are plotted here – analysis is below.

Rather than just using variable quality levels, here the only difference is that quality was plotted in distinct groups. The aim was for the groupings to “pop out” a more. It is possible to see that higher qualities (the 6-10 bucket) tend to appear to the lower right of the graph. The 0-4 qualities tend to appear to the upper left, and the 4-6, average qualities are found in-between. The overall finding is that higher percent alcohol and lower volatile acidity tends to be associated with higher rated wine.

Three variables are again plotted here – analysis is below.

Here the quality bins are again used to show how citric acid, and percent alcohol affect quality. The high quality grouping lies to the upper right, and the low quality lies to the lower left (with a high variance in this case). Thus, we see again that higher citric acid levels and higher percent alcohol are both generally correlated with higher quality ratings. Of note, also is that there are many observations with no citric acid at all.

Analysis is below.

## 
## Calls:
## m1: lm(formula = quality ~ alcohol, data = wine)
## m2: lm(formula = quality ~ alcohol + volatile.acidity, data = wine)
## m3: lm(formula = quality ~ alcohol + volatile.acidity + total.sulfur.dioxide, 
##     data = wine)
## m4: lm(formula = quality ~ alcohol + volatile.acidity + total.sulfur.dioxide + 
##     density, data = wine)
## 
## =====================================================================
##                            m1         m2         m3          m4      
## ---------------------------------------------------------------------
##   (Intercept)            1.875***   3.095***   3.305***  -16.936     
##                         (0.175)    (0.184)    (0.192)    (10.264)    
##   alcohol                0.361***   0.314***   0.302***    0.320***  
##                         (0.017)    (0.016)    (0.016)     (0.019)    
##   volatile.acidity                 -1.384***  -1.371***   -1.353***  
##                                    (0.095)    (0.095)     (0.095)    
##   total.sulfur.dioxide                        -0.002***   -0.002***  
##                                               (0.001)     (0.001)    
##   density                                                 20.103*    
##                                                          (10.192)    
## ---------------------------------------------------------------------
##   R-squared                 0.227      0.317      0.323      0.325   
##   adj. R-squared            0.226      0.316      0.322      0.323   
##   sigma                     0.710      0.668      0.665      0.664   
##   F                       468.267    370.379    253.797    191.665   
##   p                         0.000      0.000      0.000      0.000   
##   Log-likelihood        -1721.057  -1621.814  -1614.623  -1612.674   
##   Deviance                805.870    711.796    705.423    703.706   
##   AIC                    3448.114   3251.628   3239.246   3237.349   
##   BIC                    3464.245   3273.136   3266.132   3269.611   
##   N                      1599       1599       1599       1599       
## =====================================================================

Multivariate Analysis

Talk about some of the relationships you observed in this part of the
investigation. Were there features that strengthened each other in terms of
looking at your feature(s) of interest?

Looking at the plot of, “Alcohol vs Volatile Acidity vs Quality”, it’s evident that the higher quality wines tend to fall to the lower right of the graph and the lower quality wines fall to the upper left. Thus higher alcohol content and lower volatile.acidity is associated with higher-quality wines.

Also plotted was: Alcohol vs Citric Acid vs Quality and it’s observed that the higher quality wines tended to the upper right quadrant - and lower quality wines fell in the lower left – in line with expectations.

Finally, for, “Free Sulfur Dioxide vs Total Sulfur Dioxide vs Quality Rating” More detail is given on the relationship between total sulfur dioxide and free sulfur dioxide (r = .67). The plot reveals that there is no clear relationship between them and quality as we observe great variance in quality plots. This result was not unexpected, but it was interesting to see what the plot looked like.

Were there any interesting or surprising interactions between features?

Did you create any models with your dataset? Discuss the
strengths and limitations of your model.

As an exercise, a basic linear model was created, composed of alcohol, volatile.acidity, total.sulfur.dioxide, and density inputs. It largely did not perform very well, as its R-squared value was just 0.325. Oddly, adding citric.acid (r = .23) as an input didn’t appear to improve the results in spite of its correlation with quality. Also attempted was adding the logarithm of total.sulfur.dioxide, but it did not improve the results.


Final Plots and Summary

Plot One

Description One

The previous analysis of this graph will be not be repeated here (see above for the previous description). We additionally see that by analyzing trend lines, with low quality wines (tending to appear to the upper left), volitile acidity tends to increase with an increased percent of alcohol, while the same trend is slightly opposite for average quality wines, and quite flat for high quality wines (tending to the lower right).

Plot Two

Description Two

Adding to the previous analysis of this graph (see above), we see (via the trend lines) for the low and high quality wines, a decrease in percent of citric acid concentration with an increase in alcohol percentage. Oddly, there is little affect for the average quality wines. The position of the high quality wines having higher citric acid is clearly distinguished from low quality wines via the trend lines, though average quality wines tend almost converge with high quality wines at a level of 14% alcohol concentration.

Plot Three

Description Three

Again, past analysis will not be revisited here, but it can be clearly seen that the slope of the relationship between free sulfur dioxide and total sulfur dioxide is consistent for all three quality trendlines, thus giving further evidence of that we’re seeing a relationship of dependent variables i.e: that we may be seeing different variables that hold a similar relationship.


Reflection

Many of the data revalations in this study came from the correlation matrix between variables. The data plots largely verified / confirmed the correlations / distributions were valid and revealed additional data variances and outliers. It was interesting to see that though e.g: density had a relatively high correlation with alcohol (r = .50) and that alcohol had a high correlation with quality (r = .48) that density (and other influencing variables) did not have a high correlation with quality (r = -.18 ). The linear model did not work out as well as would have been ideal, as an R-squared of 0.325 has limited predictive utility.

Revealing plots were created for highly-correlated variables, such alcohol, volatile acidity and quality. Adding trendlines in the graph proved fruitful, further revealed the clear distinctions among different quality levels.

While the trendlines in the graph, “Multivariate: Alcohol vs Citric Acid vs Quality” were complex, trends were made clearer by grouping qualities in the graph, “Alcohol vs Citric Acid vs Quality Rating”, a success.

I struggled for quite a while, researching how to adjust the font-size of the correlation matrix to make it readable. It was surprisingly complex to get a readable plot. The ggcorr function produced a more-useable heatmap/correlation matrix, though there were still some issues with the text (to the lower left).

I also struggled a bit with color palates and using the factor() function to enable proper distinct plotting of trendlines.

Other researched items in the sources (below) indicate other areas where internet research was used to implement fixes.

As for future work, it would be useful to have a richer dataset to test with e.g: the type of grape, the geographical location of the vineyard, the vintage, the type of cask the grapes were stored in, the label of the grape, to know which reviewer gave a review for each grape: e.g: reviewer A could have different tastes than reviewer B.,